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ABSTRACT 


The  provision  of  an  efficient  and  accurate  Computational  Fluid  Dynamics  (CFD)  solver 
for  simulating  flows  within  an  urban  complex  on  the  microscale  (with  a  characteristic 
length  scale  extending  to  about  1-2  km)  is  required  to  support  the  development  of  an 
advanced  atmospheric  transport  and  diffusion  multiscale  modeling  system  for  hazard 
assessment  of  chemical,  biological,  radiological  or  nuclear  (CBRN)  agents  released 
within  cities.  The  ultimate  goal  of  this  supporting  effort  is  to  construct  a  virtual  test 
facility  (VTF)  that  could  eventually  be  used  to  develop  guidelines  and  procedures  for 
operating  in  the  urban  complex  after  a  CBRN  incident. 

Details  of  the  numerical  algorithms  used  for  the  “stand-alone”  microscale  CFD  solver 
urbanSTREAM  executing  on  serial  computers  have  been  described  in  [1].  The  objectives 
of  this  report  are  twofold.  The  first  objective  is  to  provide  a  description  of  the  coupling  of 
urbanSTREAM  with  an  “urbanized”  meso-gamma  scale  flow  model  (Global 
Environmental  Multi-scale  Local  Area  Model,  or  GEM  LAM,  developed  by  Canadian 
Meteorological  Centre).  A  “one-way  interaction”  scheme  is  adopted,  which  allows  for  the 
matching  of  velocity  and  turbulence  fields  in  any  overlap  region  in  a  physically  and 
mathematically  consistent  manner  that  preserves  the  physical  conservation  laws,  mutually 
satisfies  mathematical  boundary  conditions,  and  preserves  numerical  accuracy.  The 
second  objective  is  to  provide  implementation  details  about  how  urbanSTREAM  is 
parallelized  for  use  on  PC  clusters  and  the  IBM  massively  parallel  supercomputing 
platform  to  give  a  problem-solving  environment  for  full  3-D  parallel  simulation. 
Validation  of  this  fully-coupled  mesoscale-to-microscale  system  is  achieved  by  detailed 
comparison  of  model  predictions  with  comprehensive,  high-quality  full-scale  urban 
measurements  in  a  real  cityscape  -  the  Joint  Urban  2003  (JU2003)  conducted  in 
Oklahoma  City  in  the  US. 
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1.  INTRODUCTION 


The  objective  of  CRTI  Project  02-0093RD  is  to  develop  and  validate  a  prototype  state-of- 
the-science  multiscale  modeling  system  for  predicting  the  transport  and  dispersion  of 
chemical,  biological,  radiological  or  nuclear  (CBRN)  materials  in  the  urban  environment 
and  beyond.  This  prototype  modeling  system  serves  as  the  basis  of  a  high-fidelity 
predictive  tool  for  scenario  planning,  forensic  and  post-event  analysis,  as  well  as  for 
operational  response.  Incorporating  the  proposed  system’s  capabilities  in  a  government 
operations  center  will  result  in  improved  emergency  preparedness  for  and  management  of 
potential  CBRN  incidents  in  Canadian  cities.  The  tool  can  be  used  for  planning  at  events 
of  national  significance  (e.g.,  G8  summit,  APEC  meeting,  2010  Winter  Olympics). 

The  details  of  the  urban  microscale  modeling  system  developed  for  Component  1  of 
CRTI  Project  02-0093RD,  along  with  its  five  main  modules  (viz.,  urbanGRID, 
urbanSTREAM,  urbanEU,  urbanAEU  and  urbanPOST)  are  described  in  [1],  Results 
shown  in  [1]  were  obtained  with  the  serial  version  of  urbanSTREAM.  In  order  to 
significantly  reduce  the  wall-clock  time  and  render  the  use  of  urbanSTREAM  more 
feasible  for  emergency  response,  it  is  imperative  that  urbanSTREAM  be  implemented  on 
a  parallel  computer  system.  This  report  will  describe  in  detail  how  urbanSTREAM  is 
parallelized  on  a  distributed-memory  computer  system  using  the  Message-Passing 
Interface  (MPI)  library  [2],  in  conjunction  with  the  “domain-decomposition”  approach.  In 
what  follows,  “urbanSTREAM-P”  will  be  used  to  denote  the  parallel  version  of 
urbanSTREAM. 

The  concept  underlying  the  multi-block  algorithm  will  first  be  described  in  Section  2, 
including  changes  to  the  index  system  in  the  DO  loop  and  construction  of  the  “halo  data” 
regions  required  to  be  incorporated  into  urbanSTREAM.  Porting  the  multi-block  version 
of  urbanSTREAM  to  distributed-memory  computer  architectures  using  MPI  will  be 
addressed  in  Section  3.  Validation  of  urbanSTREAM-P,  including  measurement  of  its 
parallel  efficiency  on  various  parallel  systems,  for  two  test  problems  involving  the 
simulation  of  the  disturbed  flow  field  through  an  array  of  obstacles  and  Oklahoma  City 
will  be  discussed  in  Sections  4  and  5,  respectively.  Finally,  Section  6  outlines  conclusions 
and  recommendations  for  future  extensions  to  the  present  study. 


2.  MULTI-BLOCK  ALGORITHM 

The  multi-block  algorithm  [3]  relies,  essentially,  on  sub-dividing  the  solution  domain  into 
an  arbitrary  number  of  contiguous,  non-overlapping  blocks,  each  having  its  own  grid  and 
(if  necessary)  its  own  associated  local  coordinate  system.  Each  grid  is  first  generated 
separately  by  use  of  a  suitable  grid-generation  procedure,  the  only  constraint  being 
continuity  in  grid-line  positions  across  the  block  boundaries.  Each  block,  viewed  in 
isolation,  is  then  surrounded  by  an  ‘auxiliary’  layer  of  two  cells  originating  from 
neighboring  blocks.  In  reality,  the  block  is  made  to  penetrate  into  its  neighbors  to  the 
extent  of  two  cells  in  order  to  accommodate  ‘halo  data’,  which  are  needed  for  the  solution 
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within  the  block  in  question.  The  choice  of  a  two-cell  penetration  is  linked  to  the  nature  of 
the  higher-order  convection  scheme  and  the  Rhie  &  Chow  interpolation  practice  [4]: 


Ue  =—(UP+UE ) 
2 


+\[^(Pe-pJ  +  Dl(pee-pe)-(Dj+Dl)(pE-pp) 


(1) 


where,  e.g.,  D"  =  Ap  /  ap  (see  Eqs.  (40),  (45)  in  [1]). 


Although  the  coordinate  systems  of  neighboring  blocks  can  be  quite  different  in 
orientation,  as  exemplified  in  Figure  1,  all  geometric  data  pertaining  to  the  auxiliaiy  layer 
attached  to  the  parent  block,  including  the  metric  tensors  and  the  Jacobian,  are  treated  in 
terms  of  the  coordinate  system  of  the  parent  block  and  are  stored  as  if  the  layer  were  part 
of  the  block.  This  arrangement  obviates,  with  one  exception  noted  below,  the  need  for  any 
one  block  to  directly  access  the  ‘foreign’  geometric  information  and  mass  fluxes  residing 
in  the  neighboring  blocks  during  the  solution  process  within  the  individual  block. 

Inter-block  connectivity  is  handled  using  a  connectivity  matrix  in  the  form  of  the  2D  array 
MCONEC(BLOCK,FACE),  where  BLOCK  is  the  block  number  being  considered,  FACE 
identifies  the  block  face  (ranging  from  1  to  6,  with  1  denoting  the  eastern  face,  2  the 
western  face,  3  the  northern  face,  4  the  southern  face,  5  the  top  face  and  6  the  bottom 
face),  and  MCONEC  denotes  the  block  sharing  FACE  with  BLOCK.  The  coordinate 
system  relating  to  any  one  block  is  stored  in  the  form  of  COORD(BLOCK,FACE), 
representing  all  possible  coordinate  permutations  in  the  neighboring  block  sharing  the 
face  ‘FACE’.  To  convey  the  basic  idea  without  introducing  a  significant  loss  of 
generality,  a  typical  2D  example  for  COORD  is  illustrated  in  Figure  2.  The  neighboring 
right-hand-side  block  can  take  any  one  of  eight  combinations  of  coordinates,  signified  by 
the  integers  1  to  8.  Another  example  illustrating  the  use  of  both  MCONEC  and  COORD 
is  given  below  by  reference  to  Figures  1  and  2,  where 


and 


MCONEC(3 , 1  )=2, 
MCONEC(3,3)=0, 

COORD  (3,1)=8, 
COORD  (3,3)=0, 


MCONEC(3,2)=0, 
MCONEC(3,4)=l , 

COORD  (3,2)=0, 
COORD  (3,4)=  1. 


In  the  above,  the  value  ‘0’  signifies  that  the  neighboring  block  is  a  physical  (real) 
boundary  of  the  solution  domain. 


By  reference  to  Eq.  (1)  and  the  apE  coefficient  in  the  pressure-correction  equation  for  the 
SIMPLE  algorithm  [5]  (see  Eq.  (49a)  in  [1]): 

<=DUe  (2) 
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where  Du  A  at  cell  face  ‘e’  are  obtained  as  a  centered  average  of  the  values  of  D"  A  at  the 
two  neighboring  nodes  on  either  side  of  the  face  (viz.,  at  nodes  P  and  E),  and  the 
coefficients  in  the  pressure-correction  equation  depend  on  the  ap  value  associated  with 
the  momentum  equations  applied  to  the  cell  over  which  mass  conservation  is  to  be 
satisfied,  as  well  as  two  neighboring  cells  on  either  side  in  any  coordinate  direction.  It  is 
crucial,  therefore,  to  transmit  this  quantity  from  the  neighboring  blocks  into  the  auxiliary 
two-cell  layer  when  solving  the  momentum  equation  in  the  parent  block.  This  transfer  is 
greatly  assisted  by  the  fact  that  ap  is  coordinate-invariant,  i.e.,  independent  of  the  block- 
local  coordinate  system. 

Once  the  coefficients  for  the  transport  and  pressure-correction  equations  have  been 
assembled  for  each  block,  the  resulting  system  of  equations  is  solved  in  a  segregated 
manner  as  illustrated  in  Figure  3.  Each  set  of  equations  pertaining  to  any  one  block  is 
solved  within  an  ‘inner  iteration’  by  Stone’s  SIP  method  [6]  concurrently  with  the 
temporarily  ‘frozen’  block  boundary  conditions  in  the  ‘halo’  region.  Then,  an  update  of 
boundary  conditions  is  effected  through  the  connectivity  matrix  and  the  identifiers  of  the 
coordinate  systems  in  the  neighboring  blocks  in  order  to  establish  the  inter-block 
coupling.  An  ‘outer  iteration’  consists  of  solving  any  one  set  of  equations  over  all  blocks 
and  the  associated  exchange  of  data  across  block  boundaries.  This  sweep  is  arranged  as  a 
Block  Jacobi  method.  A  sample  multi-block  mesh  for  a  flow  over  a  2D  two-element 
airfoil  from  [7]  is  illustrated  in  Figure  4.  The  multi-block  code  was  originally  designed  to 
run  on  a  shared-memory  machine.  Porting  a  multi-block  code  to  a  distributed-memoiy 
system  will  be  discussed  in  the  next  section. 


3.  PORTING  TO  DISTRIBUTED-MEMORY  MACHINE 

The  major  change  between  the  serial  and  parallel  versions  of  urbanSTREAM  lies  in  the 
index  system  in  the  DO  loop,  which  is  illustrated  by  the  following  partial  FORTRAN 
listing  of  subroutine  CALCU  -  a  subroutine  which  solves  the  x-componcnt  of  the  velocity 
in  the  discretized  Navier-Stokes  equation: 

Serial  version: 

DO  K=2 , NKMl 
DO  J=2 ,  NJM1 
DO  1=2 , NIM1 

IF(iand(flag(i, j ,k) ,C_0) /=0)  THEN 
AB (I , J, K) =0 . 0 
AT (I , J, K) =0 . 0 
AS  (I, J,K)=0.0 
AN  (I ,  J,  K)  =  0 . 0 
AW (I, J,K) =0.0 
AE(I, J,K) =0.0 
SU(I, J,K) =0.0 
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SP(I, J,K)=0.0 
CYCLE 
END  IF 


SU(I,  J,K)=SU(I,  J,K) 

1 +DU1 ( I , J , K) * ( PRESW- PRESE ) 

1+DU2 (I, J,K) * ( PRESS -PRESN) 

1+DU3 (I, J,K) * (PRESB-PREST) 

l+( (AF+BT) *UO(I, J,K) -BT*UM ( I ,  J ,  K) ) *VOL ( I , J, K) 
1/DELT 

END  DO 
END  DO 
END  DO 

Parallel  version: 


NB=1 


DO  K=NKBEG(NB) +2 , NKEND (NB) -2 
DO  J=NJBEG  (NB)  +2  ,  NJEND  (NB)  -  2 
DO  I=NIBEG (NB) +2 , NIEND (NB) -2 

IF ( iand ( f lag ( i , j , k) ,C_0) /=0)  THEN 
AB ( I , J , K) =0 . 0 
AT ( I , J, K) =0 . 0 
AS (I , J, K) =0 . 0 
AN (I , J, K) =0 . 0 
AW ( I , J, K) =0 . 0 
AE (I , J, K) =0 . 0 
SU(I, J,K)  =0 . 0 
SP(I, J,K)=0.0 
CYCLE 
END  IF 


SU(I,  J,K)=SU(I,  J,K) 

1+DU1 (I, J,K) * (PRESW- PRESE) 

1+DU2 (I, J,K) * (PRESS -PRESN) 

1+DU3 (I, J,K) * (PRESB-PREST) 

l+( (AF+BT) *UO(I, J,K) -BT*UM ( I , J, K) ) *VOL(I, J,K) 

1/DELT 

END  DO 
END  DO 
END  DO 

A  data  file  containing  coordinate  information  of  a  computational  mesh  generated  by 
urbanGRID,  called  “grid_urbanSTREAM.dat”,  is  read  first  into  the  grid  partitioning 
program,  called  “gridgn_inflow90”,  which  automatically  partitions  the  single  grid  into 
NBLOC  partitions  of  approximately  equal  sizes  in  order  to  achieve  load-balancing  (see 
Figure  5).  Here,  NBLOC  =  I  MAX  x  JMAX  x  K_MAX,  in  which  I  MAX,  JMAX  and 
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K  MAX  represent  the  number  of  partitions  in  the  I-,  J-  and  K-direction  (in  grid 
coordinates),  respectively.  The  user  needs  to  specify  I  MAX,  J  MAX  and  K  MAX  in 
“partition.dat”  in  order  to  determine  NBLOC,  which  is  the  same  as  the  number  of  central 
processing  units  (CPUs)  required  to  run  urbanSTREAM-P.  An  example  with  I_MAX=2, 
J_MAX=2  and  K_MAX=2  is  illustrated  in  Figure  6  and  one-eighth  of  the  computational 
mesh  is  shown  in  Figure  7,  in  which  the  “halo  data”  regions  are  highlighted  (see  also 
Figure  8).  Swapping  of  information  in  the  halo  data  regions  is  achieved  by  calling  the 
OVELQ  subroutine.  A  partial  FORTRAN  listing  of  the  subroutine  OVELQ  including 
message-passing  interface  MPI  [2]  routines  on  the  eastern  and  western  boundaries  is 
given  below: 

c 

C  East 

C 

IF (MCONEC (node+1, 1) .NE.O)  THEN 
C 

C  packing  message 
C 

DO  L=1 , 2  *NY*NZ 
QSDE(L) =0. 

END  DO 
C 

L=0 

DO  K=NKBEG ( 1) +2 , NKEND ( 1) -2 
DO  J=NJBEG (1 ) +2 , NJEND ( 1 ) -2 
L=L+1 

QSDE (L) =Q (NIEND (1) -2,J,K) 

END  DO 
END  DO 

DO  K=NKBEG (1) +2 , NKEND (1) -2 
DO  J =NJBEG ( 1 ) +  2 , NJEND ( 1 ) - 2 
L=L+1 

QSDE (L)=Q (NIEND (1) -3,J,K) 

END  DO 
END  DO 
C 

CALL  MPI_SEMD (QSDE, 2*NY*NZ , MPI_REAL8 , 

&MCONEC (node+1, 1) -l,MSGPTR(node+l) +l,MPI_COMM_WORLD, ierr) 

i 

CALL  MPI_RECV (QREE , 2*NY*NZ , MPI_REAL8 , 

&MCONEC (node+1, 1) -1,MSGPTR (MCONEC (node+1, 1) ) +2,MPI_COMM_WORLD, 
&status, ierr) 

C 

C  unpacking  message 
C 


L=0 

DO  K=NKBEG ( 1 ) +2 , NKEND ( 1) -2 
DO  J=NJBEG ( 1 ) +2 , NJEND ( 1) -2 
L=L+1 

Q (NIEND (1) -1, J,K) =QREE (L) 
END  DO 
END  DO 

DO  K=NKBEG(1) +2, NKEND (1) -2 
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o  o 


c 


DO  J=NJBEG (1) +2 ,NJEND (1) -2 
L=L+1 


Q(NIEND(1)  ,J,K)=QREE(L) 
END  DO 
END  DO 


END  IF 


West 

C 

IF (MCONEC (node+1,  2) .NE.O)  THEN 
CALL  MPI_RECV (QREW, 2*NY*NZ , MPI_REAL8 , 

&MCONEC (node+1, 2) -1,MSGPTR (MCONEC (node+1, 2) ) + 1 ,  MP I_COMM_WORLD , 
&status, ierr) 

C 

C  unpacking  message 
C 

L=0 

DO  K=NKBEG ( 1) +2 , NKEND { 1) -2 
DO  J =N JBEG ( 1 ) +2 , N JEND ( 1 ) - 2 
L=L+1 

Q(NIBEG(1) +1, J,K) =QREW(L) 

END  DO 
END  DO 

DO  K=NKBEG ( 1 ) +2 , NKEND ( 1) -2 
DO  J=NJBEG ( 1 ) +2 , NJEND ( 1) -2 
L=L+1 

Q(NIBEG(1)  , J, K) =QREW (L) 

END  DO 
END  DO 
C 

C  packing  message 
C 

DO  L=1,2*NY*NZ 
QSDW(L) =0 . 

END  DO 
C 

L=0 

DO  K=NKBEG ( 1) +2 , NKEND ( 1) -2 
DO  J=NJBEG (1) +2 , NJEND (1) -2 
L=L+1 

QSDW (L) =Q (NIBEG ( 1 ) +2 , J ,  K) 

END  DO 
END  DO 

DO  K=NKBEG ( 1 ) +2 , NKEND ( 1 ) - 2 
DO  J=NJBEG(1) +2, NJEND (1) -2 
L=L+1 

QSDW (L) =Q (NIBEG (1) +3, J,K) 

END  DO 
END  DO 
C 

CALL  MPI_SEND (QSDW, 2 *NY*NZ , MPI_REAL8 , 

&MCONEC (node+1 , 2 ) - 1 , MSGPTR (node+1) +2 , MPI_COMM_WORLD, ierr) 

C 

END  IF 
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There  are  two  MPI  routines  used  in  OVELQ,  namely  MPISEND  and  MPIRECV, 
which  send  messages  to  and  receive  messages  from  the  neighboring  blocks,  respectively 
(see  Figure  9).  The  index  of  the  neighboring  block  (or,  equivalently,  the  node  number  in 
the  parallel  system)  is  identified  by  the  connectivity  matrix  MCONEC  as  explained  in 
Section  2.  The  index  system  used  for  a  block  in  question  and  four  neighboring  halo  data 
regions  in  a  2-D  environment  is  presented  in  Figure  10,  to  facilitate  a  better  understanding 
of  the  above  sample  code.  Note  that  MPI  SEND  is  “blocking”.  This  means  that  it  does 
not  return  until  the  message  data  have  been  stored  away,  so  that  the  sender  is  free  to 
access  and  overwrite  the  send  buffer.  Similarly,  MPI  RECV  is  also  blocking.  It  returns 
only  after  the  receive  buffer  contains  the  newly  received  message.  Message  buffering 
decouples  the  send  and  receive  operations.  A  blocking  send  might  complete  as  soon  as  the 
message  was  buffered,  even  if  no  matching  receive  has  been  recorded  by  the  receiver.  On 
the  other  hand,  message  buffering  can  be  expensive,  as  it  entails  additional  memory-to- 
memory  copying,  and  it  requires  the  allocation  of  memory.  In  an  ill-constructed  program, 
blocking  may  lead  to  a  “deadlock”  situation,  where  all  processes  are  blocked 
(deadlocked),  and  no  progress  occurs.  Such  programs  may  complete  when  sufficient 
buffer  space  is  available,  but  will  fail  on  systems  that  use  less  buffering,  or  when  message 
sizes  are  increased.  Since  any  system  will  run  out  of  buffer  resources  as  message  sizes  are 
increased,  a  “safe  programming  style”  is  adopted  in  the  present  implementation  as 
illustrated  in  the  above  sample  code,  which  succeeds  even  if  no  buffer  space  is  available 
and,  therefore,  is  safe  and  will  always  complete  correctly. 

When  the  line-by-line  TDMA  (Tri-Diagonal  Matrix  Algorithm)  [8]  is  used  in  the 
LISOLV  subroutine,  OVEL  P  is  inserted  after  LISOLV  as  exemplified  in  a  partial  list  of 
FORTRAN  code  in  the  CALCU  subroutine  below: 

DO  N= 1 , NSWPU 
CALL  LISOLV (U) 

CALL  MPI_BARRIER(MPI_COMM_WORLD, ierr) 

CALL  OVEL_Q (U) 

END  DO 
C 

CALL  OVELQ(AP) 

C 

DO  K=NKBEG (NB) +1 , NKEND (NB) - 1 
DO  J=NJBEG (NB) +1 , NJEND (NB) - 1 
DO  I=NIBEG (NB) +1 , NIEND (NB) - 1 

DU1 (I , J, K) =XX ( I , J, K) / (AP (I , J, K) +TINY) 

DU2 (I, J,K) =YX(I# J,K) /(AP(I, J,K) +TINY) 

DU 3 ( I , J, K) =ZX ( I , J, K) / (AP ( I , J, K) +TINY) 

END  DO 
END  DO 
END  DO 

Note  also  that  it  is  important  to  swap  ap  between  halo  data  regions,  as  described  in 

Section  2,  in  order  to  ensure  that  mass  fluxes  at  block  boundaries,  evaluated  using  Rhie- 
Chow  interpolation  [4],  are  implemented  correctly. 
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4.  APPLICATION  TO  AN  ARRAY  OF  OBSTACLES 


The  geometry  of  the  obstacle  array  is  exhibited  in  Figure  11.  Meinders  and  Hanjalic  [9] 
made  detailed  measurements  of  the  mean  flow  and  turbulence  energy  using  a  two- 
component  laser  Doppler  anemometer  within  this  array  of  obstacles.  To  simulate  the  flow 
in  this  array,  we  used  a  computational  domain  consisting  of  a  sub-channel  unit  of  4Hx 
3.4H*4H  as  shown  in  Figure  11.  Calculations  were  performed  on  a  mesh  of  85x45x45 
grid  lines  in  x-,  y-  and  z-directions,  respectively,  using  the  standard  k-e  turbulence  model. 
These  directions  correspond,  respectively,  to  the  streamwise,  vertical,  and  spanwise 
directions  in  the  plane  channel.  The  turbulence  kinetic  energy  (TKE)  contours,  which  are 
continuous  across  block  boundaries,  verify  that  the  message-passing  routines 
(MPI  SEND  and  MPI  RECV)  in  OVELP  are  implemented  correctly.  Figure  12  shows  a 
comparison  between  predicted  and  measured  streamwise  velocity  and  TKE  profiles  at  two 
stations,  x/H=  -0.3  and  0.3.  Agreement  in  terms  of  streamwise  velocity  is  generally  fairly 
good.  However,  URANS  tends  to  under-predict  TKE  within  the  street  canyon.  It  is  also 
observed  in  Figure  12  that  sensitivities  of  the  solutions  to  two  different  time-stepping 
schemes  are  very  weak.  The  parallel  efficiency,  measured  on  the  Nexus  and  Australis 
computer  systems  using  up  to  32  CPUs  on  Westgrid  fwww.westgrid.cak  is  exhibited  in 
Figure  13.  Westgrid  is  a  grid-enabled  system  for  high  performance  computing  across 
Western  Canada.  As  seen,  the  parallel  efficiency  relative  to  2  CPUs,  defined  as 

Parallel  Efficiency^  l2CPUs  per  lteratlon  (3) 

(n/2)TnCPUs  per  iteration 

where  T  stands  for  CPU  time,  drops  to  81%  as  32  CPUs  are  used. 


5.  APPLICATION  TO  JOINT  URBAN  2003 

A  major  field  experiment,  Joint  Urban  2003  (JU2003)  [10],  was  conducted  in  Oklahoma 
City  from  June  28  to  July  31,  2003  to  collect  meteorological  and  tracer  data  sets  for 
evaluating  dispersion  models  in  urban  areas.  Various  meteorological  instrumentation  and 
tracer  samplers  were  installed  at  various  locations  throughout  and  around  the  city  in  order 
to  track  the  air  movement  of  safe,  non-toxic  tracer  gases.  The  purpose  of  this  study  is  to 
provide  a  better  understanding  of  pollutant  dispersion  in  an  urban  environment,  and  the 
data  can  be  used  to  develop  and  validate  urban  models  for  flow  and  dispersion.  Further 
details  of  the  experiment  can  be  found  at  https://iu2003-dpg.dpg.armv.mil/. 

The  computational  domain  consists  of  an  “inner  region”,  in  which  all  buildings  are 
explicitly  resolved  as  shown  in  Figure  14,  and  an  “outer  region”  in  which  the  effect  of  an 
aggrcgati°n  of  groups  of  buildings  are  treated  as  a  porous  barrier  [11]  with  a  normalized 
drag  coefficient  Cd  =  CdAKref  =  100  (where  Aref  =  645  m  is  the  reference  length  scale).  A 
mesh  of  99x139x69  grid  lines  was  used  to  cover  the  entire  computational  domain  (inner 
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plus  outer  regions),  inside  which  a  mesh  of  55x100x69  grid  lines  was  used  for  the  inner 
region.  The  robustness  of  urbanSTREAM-P  is  demonstrated  by  performing  calculations 
using  power-law  velocity  profiles  as  inflow  conditions  for  the  south-east,  south-west, 
north-east  and  north-west  wind  directions  as  shown  in  Figure  15.  The  parallel  efficiency, 
defined  as 

Parallel  Efficiency  T.cro.  iteration  (4) 

(n)TnCPUs  per  iteration 

which  was  measured  on  the  Narwhal  computer  system  on  Sharcnet  (www.sharcnet.ca) 
using  up  to  32  CPUs,  for  the  south-west  wind  direction  is  exhibited  in  Figure  16.  The 
parallel  efficiency  drops  to  48%  when  32  CPUs  were  used,  suggesting  that  the  “safe”  MPI 
implementation  described  in  Section  3  will  need  fiirther  improvement. 

To  demonstrate  consistency,  vertical  profiles  of  mean  wind  speed  and  wind  direction  at 
35.46474°  N  and  97.51822°  W  using  inflow  conditions  interpolated  from  the  early  results 
of  the  urban  GEM  LAM  model  [12]  for  both  serial  and  parallel  versions  of 
urbanSTREAM  (urbanSTREAM-S  and  urbanSTREAM-P,  respectively)  are  shown  in 
Figure  17.  It  is  encouraging  to  see  from  Figure  17  that  identical  results  are  obtained  using 
the  serial  and  parallel  versions  of  urbanSTREAM.  Since  then,  an  improved  inflow 
condition  from  the  urban  GEM  LAM  model  has  been  provided  by  the  Canadian 
Meteorological  Center  (CMC)  and  is  exhibited  in  Figure  18,  in  which  a  comparison  with 
measurements  for  IOP-9  at  06:30  UTC  on  28  July  2003  is  made  at  35.45295°  N  and 
97.52534°  W,  which  we  refer  to  as  the  PNNL  (Pacific  Northwest  National  Laboratory) 
location  hereafter.  The  PNNL  location  is  approximately  2  km  south  of  the  central 
business  district  of  Oklahoma  City.  Note  that  the  mean  wind  speed  is  reasonably  well 
predicted  except  near  the  ground  (z  <  40  m),  where  the  mean  wind  and  turbulence  were 
obtained  using  an  analytical  1-D  flow  model  [13]  to  extrapolate  the  flow  quantities 
predicted  by  urban  GEM  LAM  above  the  canopy  to  locations  within  the  canopy.  The 
wind  direction,  in  the  range  of  vertical  heights  75  <  z  <  200  m,  is  incorrect.  In  particular, 
measurements  suggest  that  the  mean  wind  direction  is  =176°,  whereas  the  urban  GEM 
LAM  model  predicts  a  wind  direction  of =183°.  The  effect  of  this  minor  difference  in  the 
mean  wind  direction  on  the  dispersion  predictions  will  be  discussed  later. 

Figures  19  and  20  compare  the  predicted  vertical  profiles  of  the  mean  horizontal  wind 
speed  and  wind  direction  with  associated  measured  values  obtained  using  minisodars  at 
35.46474°  N  and  97.51822°  W  (in  the  Botanical  Gardens),  and  at  35.47866°  N  and 
97.51707°  W  (near  10th  Street  and  Harvey  Avenue),  respectively.  The  minisodars  were 
deployed  by  Argonne  National  Laboratory  (ANL).  As  can  be  seen,  the  horizontal  wind 
speeds  in  both  figures  predicted  by  RANS  agree  reasonably  well  with  measurements. 
However,  the  predicted  wind  direction  (=180°)  for  z  <  100  m  in  Figure  19  is  erroneous 
compared  to  the  measured  values,  which  are  =130°  to  180°. 

The  flow  field  statistics  predicted  using  RANS  were  next  used  to  “drive”  an  urban 
dispersion  model,  urbanEU  (see,  e.g.,  [14]  for  details).  The  simulations  were  conducted 
for  the  second  continuous  30-min  release  of  SF6  in  IOP-9,  which  occurred  during  the 
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period  from  06:00-06:30  UTC  on  28  July  2003.  The  dissemination  point  was  located  on 
the  south  side  of  Park  Avenue  at  35.46871°  N  and  97.51556°  W,  with  a  release  height  of 
1.9  m.  The  constant  gas  release  rate  for  this  experiment  was  2.0  g  s'1.  Figure  21  displays 
various  isopleths  (on  a  logarithmic  scale)  of  the  predicted  mean  concentration  field  close 
to  ground  level  obtained  using  urbanEU. 

Figures  21-24  compare  predictions  of  the  mean  concentrations  [in  parts-per-trillion  by 
volume  (pptv)]  obtained  with  (1)  prescribed  wind  profiles  according  to  the  power  laws  at 
the  PNNL  location  and  a  wind  direction  of  =180°,  and  (2)  wind  profiles  and  wind 
directions  interpolated  from  the  urban  GEM  LAM  model  at  various  sampling  locations  in 
Oklahoma  City.  The  experimental  data  shown  here  is  for  a  30-min  averaging  time. 
Generally  speaking,  the  predictions  for  the  mean  concentration  displayed  in  Figure  24, 
identified  as  “urbanEU  (with  PNNL)”,  for  sampling  locations  along  6th  Street  and  along 
the  1-km  sampling  arc  were  quite  good  at  or  near  the  mean  plume  centerline,  with 
predictions  within  a  factor  of  two  of  the  observed  concentration.  In  contrast,  results 
identified  as  “urbanEU  (with  GEM/LAM)”  in  Figure  24  suggested  that  the  predicted 
mean  plume  centerline  was  too  far  east,  which  was  consistent  with  the  predicted  wind 
direction,  which  averaged  approximately  25°  greater  than  the  measured  values  for  0  <  z 
<  100  m  (i.e.,  the  predicted  wind  direction  was  too  far  east),  as  shown  in  Figure  19.  This 
was  likely  caused  by  the  discrepancy  in  the  wind  direction  between  GEM  LAM 
predictions  and  the  experimental  data  observed  at  the  inflow  plane  as  seen  in  Figure  1 8 
earlier. 


6.  CONCLUSIONS 

Details  of  the  parallelization  of  urbanSTREAM  and  the  testing  of  its  performance  on 
various  parallel  computer  systems  have  been  addressed  in  this  report.  The  major  steps 
involved  in  converting  the  serial  version  of  urbanSTREAM  into  its  parallel  version  - 
termed  urbanSTREAM-P,  are  (1)  the  employment  of  the  “domain-decomposition” 
strategy  (or,  equivalently,  the  multi-block  approach)  so  that  block  connectivity 
information  and  “halo  data”  regions  can  be  constructed,  and  (2)  explicit  passing  of 
messages  stored  in  the  halo  data  regions  among  different  nodes  using  the  MPI  library  to 
establish  a  coupling  of  the  solutions  in  each  sub-domain,  which  collectively  form  the 
original,  entire  computational  domain.  In  addition,  changes  to  the  index  system  in  the  DO 
loop  are  also  necessary,  which  will  facilitate  the  future  implementation  of  the  fully- 
coupled  multigrid  (MG)  scheme  and  the  local-grid-refinement  (LGR)  method  to  be 
addressed  later. 

One-way  coupling  between  urbanSTREAM  and  the  urban  GEM  LAM  model  is 
implemented  in  the  system  and  validated  against  the  JU2003  database  for  the  IOP9  test 
problem.  The  flow  data,  including  3-D  flow  field  and  turbulence  kinetic  energy  (TKE) 
profiles,  provided  by  urban  GEM  LAM,  are  first  interpolated  by  urbanGRID  and  used  as 
inflow  conditions  to  “drive”  urbanSTREAM.  urbanSTREAM,  in  turn,  generates  time- 
averaged  velocity  and  turbulence  fields  to  “drive”  urbanEU  to  generate  concentration 
fields.  The  test  calculations  permit  the  following  conclusions  to  be  drawn  and 
recommendations  to  be  made. 
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1.  The  sensitivity  of  concentration  predictions  to  the  inflow  condition,  particularly  in 
terms  of  a  wind  direction  that  is  different  by  a  few  degrees,  was  observed  for  the 
IOP9  test  problem. 

2.  It  is  expected  that  better  concentration  predictions  could  be  obtained  if  urbanEU 
were  to  be  combined  with  urbanSTREAM  (viz.,  adding  an  advection-diffusion 
equation  for  concentration  in  urbanSTREAM)  and  executed  in  a  true  (instead  of 
pseudo)  unsteady  RANS  mode  with  inflow  conditions  from  the  GEM  LAM 
database  being  updated  every  30  minutes. 

3.  The  robustness  and  consistency  of  urbanSTREAM-P  have  been  demonstrated  by 
testing  the  code  with  four  different  wind  direction  scenarios,  namely  north-east, 
north-west,  south-east  and  south-west,  and  comparing  the  results  obtained  on  the 
same  grid  when  using  both  the  serial  and  parallel  versions  of  urbanSTREAM. 

4.  The  wall-clock  time  required  for  urbanSTEAM-P  is  significantly  reduced  when 
more  CPUs  are  used.  Hence,  the  parallel  efficiency  achieved  in  the  present  study 
still  requires  further  improvement.  For  32  CPUs,  an  approximate  parallel 
efficiency  of  50%  was  achieved,  which  is  likely  due  to  the  employment  of  a  “safe” 
programming  style  based  on  “blocking”  send  and  receive  MPI  routines.  It  is 
recommended  that  “non-blocking”  send  and  receive  MPI  routines  be 
implemented,  although  care  must  be  taken  to  avoid  “deadlock”,  particularly  for 
systems  in  which  buffer  space  is  limited. 

5.  To  further  improve  the  computational  efficiency,  in  particular  when  thermal 
stratification  effects  are  important  and  the  coupling  between  velocity  and 
temperature  fields  is  strong,  the  fully-coupled  multigrid  (MG)  method  [15]  as 
illustrated  in  Figure  25,  will  need  to  be  implemented.  Here,  the  primary  task  is  to 
compute  corrections  of  variables  for  a  system  of  transport  equations,  including 
momentum,  pressure-correction,  energy  and  turbulence  equations,  instead  of 
correction  of  a  variable  for  each  transport  equation  separately  on  a  sequence  of 
meshes  of  different  sizes.  The  main  challenge  behind  applying  MG  for  turbulent 
flows  is  that  the  TKE  can  potentially  be  “negative”  (or,  non-realizable)  at  the 
prolongation  (coarse-to-fine  grid  interpolation)  stage  during  MG  cycles. 

6.  urbanGRID  is  designed  for  use  by  a  standard  structured-grid  flow  solver,  such  as 
urbanSTREAM.  In  the  case  of  multiple  source  scenarios  (viz.,  CBRN  agents 
released  in  several  locations  throughout  a  city),  the  capability  of  local-grid- 
refinement  (LGR),  as  exemplified  in  Figure  26,  can  be  beneficial  from  the  point  of 
view  of  both  computational  efficiency  and  solution  accuracy.  LGR  allows 
sufficient  grid  resolution  to  be  provided  in  regions  of  great  interest,  such  as 
clusters  of  buildings  surrounding  the  source.  In  addition,  the  computational  cost, 
which  is  closely  related  to  the  total  number  of  grid  points,  is  significantly  reduced 
compared  to  the  case  where  an  equally-fine  grid  is  used  for  the  entire  solution 
domain.  However,  parallelizing  a  LGR  method  can  be  challenging.  A  (close  to) 
perfect  load  balancing  is  deemed  necessary  in  order  to  minimize  the  total  CPU 
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time  required  on  any  parallel  (distributed-memory  or  shared-memory)  systems.  It 
is  recommended  that  the  Hilbert’s  space-filling-curve  (HSFC)  approach  be  used  as 
demonstrated  in  [16]. 
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Figure  1.  Multi-block  arrangement  with  different  local  coordinate  systems. 
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Figure  2.  All  possible  permutations  of  local  coordinate  systems  in  blocks  adjacent  to 
any  reference  block. 


Figure  3.  Solution  sequence  of  SIMPLE  algorithm  within  a  multi-block  scheme. 
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Figure  4.  A  multi-block  mesh  for  a  2D  two-element  airfoil  171- 


Figure  5.  Interfacing  urbanGRID  and  urbanSTREAM-P. 
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K_max=2  (K) 


Figure  6.  Domain  decomposition  of  a  solution  domain  for  8  CPUs. 


Figure  7.  One-eighth  of  the  computational  mesh,  including  the  “halo  data”  regions. 
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Block  3 


B.C.  from 
neighboring  block 


Figure  8.  Illustration  of  data  swapping  between  two  adjacent  halo  data  regions. 
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Figure  9  Illustration  of  data  swapping  by  using  MPISEND  and  MPI  RECV 
routines  in  halo  data  regions. 


22 


NJEND 


Halo 


NJEND-2 


N.IBF.G+2 


NJBEG 


> 


o: 

io 

1  i~m\ 

X  *  1  X  ✓ 

i  i 

i  NB  i 

i  i 

1 

1 

1 

1 

1 

1 

o\ 

:0 

»**t  !  *"*» 

i  — 
i  •  » 

i  v  * 

*  * 

1 

i 

i  *  ✓ 
i .  .  .  . 

r« 

s  * 

1 

NIBEG  a  a  a  a  NIEND 

NIBEG+1 '  |  |  ' N I  END-2 


NIBEG+2  NIF.ND-2 


1 - 

! 

i  s  / 

l"~\ 

\  * 

o“" 

tT  \ 

oi 

io 

i  i 
i  i 
i  i 
i  i 
i  i 
i  i 

;o 

l  \ 

.-A::. 


41 


I 

m-i 

1 1 

NIEND(I)-2 


NJEND(1)-1 

NJEND(l)-2 


NIEND(l)- 


Figure  10  Index  system  used  for  the  block  in  question  and  four  neighboring  blocks 
in  a  2-D  environment. 
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Top  View  Side  View 

Standard  RANS  k-e  Model 

Figure  11.  Flow  over  a  regular  and  aligned  array  of  obstacles  obtained  with  URANS 
on  a  4-CPU  parallel  system. 
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Figure  12.  Profiles  of  streamwisc  velocity  and  turbulence  kinetic  energy  (TKE)  for  a 
flow  over  a  regular  and  aligned  array  of  obstacles  using  URANS.  Case  1:  2nd-order 
time-stepping;  Case  2:  ls,-order  time  stepping. 
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Parallel  Efficiency  =  ^  .  per  iteration 

(n/2)TnCPUl  per  iteration 


Figure  13.  Parallel  efficiency  measured  on  the  Nexus  and  Australis  computing 
systems  on  Westgrid  (www.westgrid.ca)  for  a  flow  over  a  regular  and  aligned  array 
of  obstacles  using  URANS. 
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Figure  14.  Illustration  of  the  present  drag-force  approach  applied  to  the  JU2003  test 
problem. 
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Figure  15.  Flow  visualization  obtained  with  urbanSTREAM-P  using  power-law 
velocity  profiles  as  inflow  conditions  for  south-east,  south-west,  north-east  and 
north-west  wind  directions  in  Oklahoma  City. 
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Figure  16.  Parallel  efficiency  measured  on  the  Narwhal  computing  system  on 
Sharcnet  (www.sharcnet.ca)  for  a  flow  over  Oklahoma  City  for  south-west  wind 
direction  using  URANS. 


26 


Figure  17.  Vertical  Profiles  of  mean  wind  speed  and  wind  direction  at  35.46474°  N 
and  97.51822°  W  using  inflow  conditions  interpolated  from  the  early  results  of  the 
urban  GEM  LAM  model  for  both  serial  and  parallel  versions  of  urbanSTREAM. 


Figure  18.  Comparison  of  vertical  profiles  of  mean  wind  speed  and  wind  direction 
interpolated  from  GEM  LAM  with  experimental  measurements  at  35.45295°  N  and 
97.52534°  W,  which  is  at  2  km  south  of  the  central  business  district  of  Oklahoma 
City. 


27 


Figure  19.  Comparison  of  vertical  profiles  of  mean  wind  speed  and  wind  direction 
interpolated  from  GEM  LAM  with  experimental  measurements  at  35.46474°  N  and 
97.51822°  W. 


Figure  20.  Comparison  of  vertical  profiles  of  mean  wind  speed  and  wind  direction 
interpolated  from  GEM  LAM  with  experimental  measurements  at  35.47866°  N  and 
97.51707°  W. 
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Figure  21.  Mean  plume  concentration  isopleths  (logarithmic  scale)  at  ground  level 
for  tracer  release  on  south  side  of  Park  Avenue  in  Oklahoma  City. 
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Figure  22.  Comparison  between  predicted  mean  concentrations  obtained  from 
urbanEU  with  flow  fields  provided  by  urbanSTREAM-P  in  standalone  (PNNL)  and 
coupled  (GEM/LAM)  modes,  and  experimental  measurements  obtained  with 
detectors  along  Kerr  Avenue  and  McGee  Avenue. 
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Figure  23.  Comparison  between  predicted  mean  concentrations  obtained  from 
urbanEU  with  flow  fields  provided  by  urbanSTREAM-P  in  standalone  (PNNL)  and 
coupled  (GEM/LAM)  modes,  and  experimental  measurements  obtained  with 
detectors  along  4,h  Street  and  5lh  Street. 
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Figure  24.  Comparison  between  predicted  mean  concentrations  obtained  from 
urbanEU  with  flow  fields  provided  by  urbanSTREAM-P  in  standalone  (PNNL)  and 
coupled  (GEM/LAM)  modes,  and  experimental  measurements  obtained  with 
detectors  along  6,h  Street  and  the  1-kin  sampling  arc. 
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More  efficient;  MG  account  for  coupling 
between  equations 


Figure  25.  Flow  chart  of  fully-coupled  multigrid  method  for  2-D  turbulent  flows. 


Figure  26.  Mesh  generation  for  local-grid-rcfincmcnt  with  application  to  multiple 
source  scenarios. 
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